Preface

For this problem set we will apply some of the approaches presented in ISLR for variable selection and model regularization to some of those datasets that we have worked with previously. The goal will be to see whether some of the more principled methods for model selection will allow us better understand relative variable importance, variability of predictive performance of the models, etc.

For the purposes of the preface we will use algae dataset that we used in the lecture to illustrate some of the concepts and approaches here. To do something different here in preface we will be modeling another outcome available there – AG2. The problems in the set will continue using fund raising dataset from the previous problem sets. The flow below follows closely the outline of the Labs 6.5 and 6.6 in ISLR and you are encouraged to refer to them for additional examples and details.

algaeRaw <- read.table ("coil.analysis.data.txt", header=F, sep =",", row.names =NULL, na.strings ="XXXXXXX")
colnames (algaeRaw)= c("season","size","velocity",paste0("C",1:8),paste0("AG",1:7))
algaeRaw[1:3,]
##   season   size velocity   C1   C2    C3    C4      C5      C6      C7
## 1 winter small_   medium 8.00  9.8 60.80 6.238 578.000 105.000 170.000
## 2 spring small_   medium 8.35  8.0 57.75 1.288 370.000 428.750 558.750
## 3 autumn small_   medium 8.10 11.4 40.02 5.330 346.667 125.667 187.057
##     C8 AG1  AG2 AG3 AG4  AG5 AG6 AG7
## 1 50.0 0.0  0.0 0.0 0.0 34.2 8.3 0.0
## 2  1.3 1.4  7.6 4.8 1.9  6.7 0.0 2.1
## 3 15.6 3.3 53.6 1.9 0.0  0.0 0.0 9.7
# remove cases with undefined values and three outliers:
algae <- na.omit(algaeRaw)
algae <- algae[algae[,"C4"]<max(algae[,"C4"],na.rm=TRUE)&algae[,"C3"]<max(algae[,"C3"],na.rm=TRUE)&algae[,"AG4"]<max(algae[,"AG4"],na.rm=TRUE),]
# log-transform selected attributes:
for ( iCol in 1:8 ) {
  if ( iCol > 2 ) {
    algae[,paste0("C",iCol)] <- log(algae[,paste0("C",iCol)])
  }
  if ( iCol < 8 ) {
    algae[,paste0("AG",iCol)] <- log(1+algae[,paste0("AG",iCol)])
  }
}
# we'll use AG2 as an outcome here:
algaeAG2 <- algae[,!colnames(algae)%in%paste0("AG",c(1,3:7))]
pairs(algaeAG2[,-(1:3)])

Selecting the best variable subset on the entire dataset

Assuming that we have read and pre-processed algae data (omitted observations with undefined values, log-transformed where necessary and removed egregious outliers), let’s use regsubsets from library leaps to select optimal models with the number of terms ranging from one to all variables in the dataset using each of the methods available for this function and collect corresponding model metrics (please notice that we override default value of nvmax argument and reflect on as to why we do that and use that specific value – remember that the goal here is to evaluate models up to those that include every predictor available):

summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
  # 15 because three categorical attributes are represented by dummy variables:
  rsRes <- regsubsets(AG2~.,algaeAG2,method=myMthd,nvmax=15)
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
      data.frame(method=myMthd,metric=metricName,
                nvars=1:length(summRes[[metricName]]),
                value=summRes[[metricName]]))
  }
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")+theme_bw()

We can see that, except for sequential replacement that a couple of times selected models by far more inferior to those selected by the rest of the methods, and backward coming up also with models a couple of times slightly worse than the rest for corresponding attribute counts (nvars=5:6), all others came up with the models of very comparable performance by every associated metric. Plotting variable membership for each of those models as captured by which attribute of the summary demonstrates that the first four variables are the same regardless of the method used to select them – C8, C1, C3 and medium size, after which variable inclusion somewhat varies by the method used for variable selection:

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]),
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],xlab="N(vars)",ylab="",
        xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}

par(old.par)

Using training and test data to select the best subset

Next, following Lab 6.5.3 in ISLR we will split our data approximately evenly into training and test, select the best subset of variables on training data, evaluate its performance on training and test and record which variables have been selected each time. First, to be able to use regsubsets output to make predictions we follow ISLR and setup predict function that can be applied to the output from regsubsets (notice .regsubsets in its name – this is how under S3 OOP framework in R methods are matched to corresponding classes – we will further down call it just by passing output from regsubsets to predict – this, in its turn, works because function regsubsets returns object of class regsubsets):

predict.regsubsets <- function (object, newdata, id, ...){
  form=as.formula(object$call [[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names (coefi)
  mat[,xvars] %*% coefi
}

We are all set now to repeatedly draw training sets, choose the best set of variables on them by each of the four different methods available in regsubsets, calculate test error on the remaining samples, etc. To summarize variable selection over multiple splits of the data into training and test, we will use 3-dimensional array whichSum – third dimension corresponding to the four methods available in regsubsets. To split data into training and test we will use again sample function – those who are curious and are paying attention may want to reflect on the difference in how it is done below and how it is implemented in the Ch. 6.5.3 of ISLR and what are the consequences of that. (Hint: consider how size of training or test datasets will vary from one iteration to another in these two implementations)

dfTmp <- NULL
whichSum <- array(0,dim=c(15,16,4),
  dimnames=list(NULL,colnames(model.matrix(AG2~.,algaeAG2)),
      c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(algaeAG2)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain <- regsubsets(AG2~.,algaeAG2[bTrain,],nvmax=15,method=jSelect)
    # Add up variable selections:
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:15 ) {
      # make predictions:
      testPred <- predict(rsTrain,algaeAG2[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-algaeAG2[!bTrain,"AG2"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+theme_bw()

We can see clear difference in the behavior of training and test error with the increase in the number of attributes added to the model. The training error gradually decreases even if for the larger numbers of predictors included in the model the difference between median errors is small comparing to the variability of the error across multiple splits of the data into training and test. Test error demonstrates obvious decrease upon addition of the second attribute to the model followed by steady increase with the inclusion of three or more attributes in the model. Therefore we can conclude that models with three or more attributes are overfitting in this case.

By plotting average fraction of each variable inclusion in the best model of every size by each of the four methods (darker shades of gray indicate closer to unity fraction of times given variable has been included in the best subset) we can see that the selection of C1 and C8 is farily consistent, while the selection of the rest of the attributes varies across model sizes and selection methods:

old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(whichSum)[[3]] ) {
  tmpWhich <- whichSum[,,myMthd] / nTries
  image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
        xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
        breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
        col=c("white","gray90","gray75","gray50","gray25","gray10"))
  axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
  axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}

par(old.par)

Similar observations can be made using cross-validation rather than the split of the dataset into training and test that is omitted here for the purposes of brevity.

Ridge for variable selection:

As explained in the lecture and ISLR Ch.6.6 lasso and ridge regression can be performed by glmnet function from library glmnet – its argument alpha governs the form of the shrinkage penalty, so that alpha=0 corresponds to ridge and alpha=1 – to lasso regression. The arguments to glmnet differ from those used for lm for example and require specification of the matrix of predictors and outcome separately. model.matrix is particularly helpful for specifying matrix of predictors by creating dummy variables for categorical predictors:

# -1 to get rid of intercept that glmnet knows to include:
x <- model.matrix(AG2~.,algaeAG2)[,-1]
head(algaeAG2)
##   season   size velocity   C1   C2       C3        C4       C5       C6
## 1 winter small_   medium 8.00  9.8 4.107590 1.8306596 6.359574 4.653960
## 2 spring small_   medium 8.35  8.0 4.056123 0.2530906 5.913503 6.060874
## 3 autumn small_   medium 8.10 11.4 3.689379 1.6733512 5.848365 4.833636
## 4 spring small_   medium 8.07  4.8 4.348522 0.8337783 4.586823 4.113853
## 5 autumn small_   medium 8.06  9.0 4.013677 2.3433431 5.454038 4.064263
## 6 winter small_   high__ 8.25 13.1 4.185860 2.2244073 6.063785 2.904165
##         C7        C8      AG2
## 1 5.135798 3.9120230 0.000000
## 2 6.325702 0.2623643 2.151762
## 3 5.231413 2.7472709 4.000034
## 4 4.932313 0.3364722 3.737670
## 5 4.580673 2.3513753 1.360977
## 6 4.037192 3.3463891 2.747271
# notice how it created dummy variables for categorical attributes
head(x)
##   seasonspring seasonsummer seasonwinter sizemedium sizesmall_
## 1            0            0            1          0          1
## 2            1            0            0          0          1
## 3            0            0            0          0          1
## 4            1            0            0          0          1
## 5            0            0            0          0          1
## 6            0            0            1          0          1
##   velocitylow___ velocitymedium   C1   C2       C3        C4       C5
## 1              0              1 8.00  9.8 4.107590 1.8306596 6.359574
## 2              0              1 8.35  8.0 4.056123 0.2530906 5.913503
## 3              0              1 8.10 11.4 3.689379 1.6733512 5.848365
## 4              0              1 8.07  4.8 4.348522 0.8337783 4.586823
## 5              0              1 8.06  9.0 4.013677 2.3433431 5.454038
## 6              0              0 8.25 13.1 4.185860 2.2244073 6.063785
##         C6       C7        C8
## 1 4.653960 5.135798 3.9120230
## 2 6.060874 6.325702 0.2623643
## 3 4.833636 5.231413 2.7472709
## 4 4.113853 4.932313 0.3364722
## 5 4.064263 4.580673 2.3513753
## 6 2.904165 4.037192 3.3463891
y <- algaeAG2[,"AG2"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)

Plotting output of glmnet illustrates change in the contributions of each of the predictors as amount of shrinkage changes. In ridge regression each predictor contributes more or less over the entire range of shrinkage levels.

Output of cv.glmnet shows averages and variabilities of MSE in cross-validation across different levels of regularization. lambda.min field indicates values of \(\lambda\) at which the lowest average MSE has been achieved, lambda.1se shows larger \(\lambda\) (more regularization) that has MSE 1SD (of cross-validation) higher than the minimum – this is an often recommended \(\lambda\) to use under the idea that it will be less susceptible to overfit. You may find it instructive to experiment by providing different levels of lambda other than those used by default to understand sensitivity of gv.glmnet output to them. predict depending on the value of type argument allows to access model predictions, coefficients, etc. at a given level of lambda:

cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)

cvRidgeRes$lambda.min
## [1] 0.8415975
cvRidgeRes$lambda.1se
## [1] 5.937304
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)    -3.4201892751
## seasonspring    0.0037729439
## seasonsummer   -0.0060035842
## seasonwinter   -0.0975006226
## sizemedium     -0.1480074656
## sizesmall_     -0.0593286201
## velocitylow___  0.1251026812
## velocitymedium  0.1729267669
## C1              0.4956087313
## C2             -0.0005789764
## C3              0.1063279515
## C4             -0.0087399611
## C5             -0.0163725208
## C6              0.0739619298
## C7              0.0308752691
## C8              0.1288509913
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)    -0.487007048
## seasonspring    0.006974426
## seasonsummer   -0.010993747
## seasonwinter   -0.022952662
## sizemedium     -0.027971986
## sizesmall_     -0.052221994
## velocitylow___  0.073866961
## velocitymedium  0.063881170
## C1              0.176492052
## C2             -0.005964469
## C3              0.047725915
## C4              0.003549673
## C5              0.005676802
## C6              0.037694271
## C7              0.033782329
## C8              0.051511746
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)

Similarly to what was observed for variable selection methods above, plot of cross-validation error for ridge regression has well-defined minimum indicating that some amount of regularization is necessary for the model using all attributes to prevent overfitting. Notice that minimum \(MSE\simeq 1.25\) from ridge regression here is very comparable to the minimum observed above for average test error when variables were selected by regsubsets.

Relatively higher contributions of C1 and C8 to the model outcomed are more apparent for the results of ridge regression performed on centered and, more importantly, scaled matrix of predictors:

ridgeResScaled <- glmnet(scale(x),y,alpha=0)
plot(ridgeResScaled)

cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0)
plot(cvRidgeResScaled)

predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)     1.453178693
## seasonspring    0.002582750
## seasonsummer   -0.004166460
## seasonwinter   -0.008142440
## sizemedium     -0.009666640
## sizesmall_     -0.021408277
## velocitylow___  0.023799817
## velocitymedium  0.025891807
## C1              0.067517518
## C2             -0.013170065
## C3              0.045265216
## C4              0.003686839
## C5              0.008349337
## C6              0.040914371
## C7              0.033586506
## C8              0.060896925

Lasso for variable selection

Lasso regression is done by the same call to glmnet except that now alpha=1. One can see now how more coefficients become zeroes with increasing amount of shrinkage. Notice that amount of regularization increases from right to left when plotting output of glmnet and from left to right when plotting output of cv.glmnet.

lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)

cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)

# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)

Also well-defined minimum of cross-validation MSE for lasso regularization.

predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)    -2.58075850
## seasonspring    .         
## seasonsummer    .         
## seasonwinter    .         
## sizemedium      .         
## sizesmall_      .         
## velocitylow___  .         
## velocitymedium  .         
## C1              0.45361507
## C2              .         
## C3              0.04081512
## C4              .         
## C5              .         
## C6              .         
## C7              .         
## C8              0.13480989
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)    -5.01052532
## seasonspring    .         
## seasonsummer    .         
## seasonwinter   -0.01998932
## sizemedium     -0.04831254
## sizesmall_      .         
## velocitylow___  .         
## velocitymedium  0.08367004
## C1              0.69924267
## C2              .         
## C3              0.10189251
## C4              .         
## C5              .         
## C6              0.04943100
## C7              .         
## C8              0.16728910

As explained above and illustrated in the plots for the output of cv.glmnet, lambda.1se typically corresponds to more shrinkage with more coefficients set to zero by lasso. Use of scaled predictors matrix makes for more apparent contributions of C1 and C8, and to smaller degree, C3:

lassoResScaled <- glmnet(scale(x),y,alpha=1)
plot(lassoResScaled)

cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
plot(cvLassoResScaled)

predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)    1.45317869
## seasonspring   .         
## seasonsummer   .         
## seasonwinter   .         
## sizemedium     .         
## sizesmall_     .         
## velocitylow___ .         
## velocitymedium .         
## C1             0.20081746
## C2             .         
## C3             0.03376726
## C4             .         
## C5             .         
## C6             .         
## C7             .         
## C8             0.18722644

Lasso on train/test datasets:

Lastly, we can run lasso on several training datasets and calculate corresponding test MSE and frequency of inclusion of each of the coefficients in the model:

lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 1.33928
lassoCoefCnt
##   seasonspring   seasonsummer   seasonwinter     sizemedium     sizesmall_ 
##              0              0              0              0              0 
## velocitylow___ velocitymedium             C1             C2             C3 
##              0              1             28              0             11 
##             C4             C5             C6             C7             C8 
##              0              0              6              0             27

One can conclude that typical lasso model includes two, sometimes three, coefficients and (by comparison with some of the plots above) that its test MSE is about what was observed for two to three variable models as chosen by the best subset selection approach.

Problem 1: the best subset selection (15 points)

Using fund raising dataset from the week 4 problem set (properly preprocessed: shifted/log-transformed, predictions supplied with the data excluded) select the best subsets of variables for predicting contrib by the methods available in regsubsets. Plot corresponding model metrics (rsq, rss, etc.) and discuss results presented in these plots (e.g. what number of variables appear to be optimal by different metrics) and which variables are included in models of which sizes (e.g. are there variables that are included more often than others?).

It is up to you as to whether you want to include gender attribute in your analysis. It is a categorical attribute and as such it has to be correctly represented by dummy variable(s). If you do that properly (and above preface supplies abundant examples of doing that), you will be getting three extra points for each of the problems in this week problem set that (correctly!) included gender in the analysis for the possible total extra of 3x4=12 points. If you prefer not to add this extra work, then excluding gender from the data at the outset (as you were instructed to do for week 4) is probably the cleanest way to prevent it from getting in the way of your computations and limit the points total you can earn for this problem set to the standard 60 points.

fundRaisingData <- read.table("fund-raising-with-predictions.csv", sep=",", header=TRUE)

summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
  # 15 because three categorical attributes are represented by dummy variables:
  rsRes <- regsubsets(contrib~., fundRaisingData[, 1:12], method=myMthd, nvmax=12)
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq", "rss", "adjr2", "cp", "bic") ) {
    summaryMetrics <- rbind(summaryMetrics, 
      data.frame(method=myMthd, metric=metricName, 
                nvars=1:length(summRes[[metricName]]), 
                value=summRes[[metricName]]))
  }
}
ggplot(summaryMetrics, aes(x=nvars, y=value, shape=method, colour=method)) + geom_path() + geom_point() + facet_wrap(~metric, scales="free") +   theme(legend.position="top")+theme_bw()

Four varaiables appear to minimize the residuals, as all meethods have a valley (or peak) at nvar = 4.

old.par <- par(mfrow=c(2, 2), ps=16, mar=c(5, 7, 2, 1))
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]), 
        1:ncol(whichAll[[myMthd]]), 
        whichAll[[myMthd]], xlab="N(vars)", ylab="", 
        xaxt="n", yaxt="n", breaks=c(-0.5, 0.5, 1.5), 
        col=c("white", "gray"), main=myMthd)
  axis(1, 1:nrow(whichAll[[myMthd]]), rownames(whichAll[[myMthd]]))
  axis(2, 1:ncol(whichAll[[myMthd]]), colnames(whichAll[[myMthd]]), las=2)
}

par(old.par)

The four input variables to use in the grey blocks summary are: mindate, avecontr, lastcontr, ncontrib. That is – sliding a vertical line from the left to the right, we encounter gray blocks in these lanes: mindate, avecontr, lastcontr, and ncontrib.

Problem 2: the best subset on training/test data (15 points)

Splitting fund raising dataset into training and test as shown above, please calculate and plot training and test errors (MSE) for each model size for the methods available for regsubsets. Using which field investigate stability of variable selection at each model size across multiple selections of training/test data. Discuss these results – e.g. what model size appears to be the most useful by this approach, what is the error rate corresponing to it, how stable is this conclusion across multiple methods for the best subset selection, how does this error compare to that of the predictions provided with the data (predcontr attribute)?

predict.regsubsets <- function (object, newdata, id, ...) {
  form=as.formula(object$call [[2]])
  mat=model.matrix(form, newdata)
  coefi=coef(object, id=id)
  xvars=names (coefi)
  mat[, xvars] %*% coefi
}
dfTmp <- NULL
myDimnames <- list(NULL, colnames(model.matrix(contrib~.,fundRaisingData)), c("exhaustive", "backward", "forward", "seqrep"))
whichSum2 <- array(0, dim=c(15,15,4), dimnames = myDimnames)
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(fundRaisingData)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
    rsTrain2 <- regsubsets(contrib~.,fundRaisingData[bTrain,],nvmax=15,method=jSelect)
    # Add up variable selections:
    whichSum2[,,jSelect][1:14,] <- whichSum2[,,jSelect][1:14,] + summary(rsTrain2)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:12 ) {
      # make predictions:
      testPred <- predict(rsTrain2,fundRaisingData[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-fundRaisingData[!bTrain,"contrib"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry, sel=jSelect, vars=kVarSet,
          mse=c(mseTest, summary(rsTrain2)$rss[kVarSet] / sum(bTrain)), trainTest=c("test","train")))
    }
  }
}

# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot() + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x))) + facet_wrap(~trainTest) + theme_bw()

Exhaustive, backward and forward show a decipherable dip at the three-factor mark, while seqrep’s low watermark is a little harder to see.

Problem 3: lasso regression (15 points)

Fit lasso regression model of the outcome in fund raising dataset. Plot and discuss glmnet and cv.glmnet results. Compare coefficient values at cross-validation minimum MSE and that 1SE away from it – which coefficients are set to zero? Experiment with different ranges of lambda passed to cv.glmnet and discuss the results.

# skip 1st column to get rid of intercept that glmnet knows to include:
x <- model.matrix(contrib~.,fundRaisingData)[,2:12]
head(fundRaisingData[,2:12])
##   gapmos promocontr mincontrib ncontrib maxcontrib lastcontr  avecontr
## 1     12         10          2       15          7         5  4.066667
## 2      3         14          3       21          6         5  4.857143
## 3     21          5          5       12         17        10 11.000000
## 4      6          8          5       10         12        12  9.400000
## 5      7          2         10        3         15        10 11.666667
## 6      3         16          5       26         15         7  9.576923
##   mailord mindate maxdate age
## 1      10    8801    9404  62
## 2       5    9312    9404  66
## 3       0    9001    9503  69
## 4      10    9209    9509  73
## 5       0    9511    9508  58
## 6      15    9505    8709  85
head(x)
##   gapmos promocontr mincontrib ncontrib maxcontrib lastcontr  avecontr
## 1     12         10          2       15          7         5  4.066667
## 2      3         14          3       21          6         5  4.857143
## 3     21          5          5       12         17        10 11.000000
## 4      6          8          5       10         12        12  9.400000
## 5      7          2         10        3         15        10 11.666667
## 6      3         16          5       26         15         7  9.576923
##   mailord mindate maxdate age
## 1      10    8801    9404  62
## 2       5    9312    9404  66
## 3       0    9001    9503  69
## 4      10    9209    9509  73
## 5       0    9511    9508  58
## 6      15    9505    8709  85
y <- fundRaisingData[,1:12][,"contrib"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)

cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)

cvRidgeRes$lambda.min
## [1] 0.862611
cvRidgeRes$lambda.1se
## [1] 9.689916
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  2.013600446
## gapmos       0.031953319
## promocontr  -0.043002807
## mincontrib   0.017568850
## ncontrib    -0.042676023
## maxcontrib   0.016476120
## lastcontr    0.528638141
## avecontr     0.383756359
## mailord     -0.011785932
## mindate     -0.001569336
## maxdate      0.001646156
## age          0.005576001
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  5.487909642
## gapmos       0.037500235
## promocontr  -0.066120771
## mincontrib   0.169119699
## ncontrib    -0.041707254
## maxcontrib   0.045260882
## lastcontr    0.295239706
## avecontr     0.311040885
## mailord     -0.001018152
## mindate     -0.001849530
## maxdate      0.001888851
## age          0.001862567
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)

ridgeResScaled <- glmnet(scale(x),y,alpha=0)
plot(ridgeResScaled)

cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0)
plot(cvRidgeResScaled)

predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 15.294593660
## gapmos       0.223016309
## promocontr  -0.324650608
## mincontrib   0.766011218
## ncontrib    -0.355667870
## maxcontrib   0.922269011
## lastcontr    2.386494509
## avecontr     1.757833844
## mailord      0.011716125
## mindate     -0.370756528
## maxdate      0.304517358
## age          0.007869847
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)

cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)

# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)

predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 3.7535566
## gapmos      .        
## promocontr  .        
## mincontrib  .        
## ncontrib    .        
## maxcontrib  .        
## lastcontr   0.5631689
## avecontr    0.2870962
## mailord     .        
## mindate     .        
## maxdate     .        
## age         .
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  3.09310288
## gapmos       .         
## promocontr   .         
## mincontrib   .         
## ncontrib    -0.01684874
## maxcontrib   .         
## lastcontr    0.59238921
## avecontr     0.32613837
## mailord      .         
## mindate      .         
## maxdate      .         
## age          .
lassoResScaled <- glmnet(scale(x),y,alpha=1)
plot(lassoResScaled)

cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
plot(cvLassoResScaled)

predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 15.294594
## gapmos       .       
## promocontr   .       
## mincontrib   .       
## ncontrib     .       
## maxcontrib   .       
## lastcontr    5.348660
## avecontr     1.218948
## mailord      .       
## mindate      .       
## maxdate      .       
## age          .

avecontr and lastcontr are the predictors which are isolated as non-zero.

Problem 4: lasso in resampling (15 points)

Similarly to the example shown in Preface above use resampling to estimate test error of lasso models fit to training data and stability of the variable selection by lasso across different splits of data into training and test. Use resampling approach of your choice. Compare typical model size to that obtained by the best subset selection above. Compare test error observed here to that of the predictions supplied with the data (predcontr) and the models fit above – discuss the results.

lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1)
  lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 69.49436
lassoCoefCnt
##     gapmos promocontr mincontrib   ncontrib maxcontrib  lastcontr 
##          0          0          0          0          5         30 
##   avecontr    mailord    mindate    maxdate        age 
##         30          0          0          0          0